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Abstract 

Combined compression and classification problems are becoming increasingly important in many 
applications with large amounts of sensory data and large sets of classes. These applications range 
from aided target recognition (ATR), to medical diagnosis, to speech recognition, to fault detection 
and identification in manufacturing systems. In this paper, we develop and analyze a learning vector 
quantization (LVQ) based algorithm for the combined compression and classification problem. We show 
convergence of the algorithm using techniques from stochastic approximation, namely, the ODE method. 

We illustrate the performance of our algorithm with some examples. 

Index Terms- Learning vector quantization, classification, stochastic approximation, compression, 
non-parametric 

1 Introduction 

Quite often in applications, we are faced with the problem of classifying signals (or objects) from vast 
amounts of noisy data. Equally often, the number of different distinct signals (classes) that we have in 
the problem may be quite large. If we could compress each observation (observed signal) significantly 
without distorting or annihilating the most significant features used for classification, we can achieve 
significant advantages in two directions: 

(i) We can reduce significantly the memory required for storing both the on-line and class model data; 

(ii) We can increase significantly the speed of searching and matching that is essential in any classification 
problem. 

Furthermore, performing classification on compressed data can result in better classification, due to 
the fact that compression (done correctly) can reduce the noise more than the signal [1]. For all these 
’Research supported by ONR contract 01-5-28834 under the MURI Center for Auditory and Acoustics Research, by NSF grant 
01-5-23422 and by the Lockheed Martin Chair in Systems Engineering. 
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reasons, it is important to develop methods and algorithms to perform classification of compressed data, 
or to analyze jointly the problem of compression and classification. This area has attracted recently more 
interest due to the increased number of applications requiring such algorithms. In [2] and [3], vector 
quantization methods have been used for minimizing both the distortion of compressed images and errors 
on classifying their pixel blocks. 

There is yet another significant advantage in investigating the problem of combined compression 
and classification. If such a framework is developed, we can then analyze progressive classification 
schemes, which offer significant advantages for both memory savings and for speeding up searching 
and matching. Progressive classification uses very compressed representations of the signals at first to 
perform many simple (and therefore fast) matching tests, and then progressively perform fewer but more 
complex (and therefore slower) matching tests, as needed for classification. Thus, compression becomes 
an indispensable component in such schemes, and in particular variable rate (and therefore resolution) 
compression. In the last four years, we have analyzed such progressive classification schemes on a 
variety of problems with substantial success. The structure of the algorithms we have developed has 
remained fairly stable, regardless of the particular application. This structure consists of a multiresolution 
preprocessor followed by a tree-structured classifier as the postprocessor. Sometimes a nonlinear feature 
extraction component needs to be placed between these two components. Often the postprocessor 
incorporates learning. 

To date, we have utilized wavelets as the multiresolution preprocessor and Tree-structured-vector- 
quantization (TSVQ) as the clustering postprocessor. We have applied the resulting WTSVQ algorithm 
to various ATR problems based on radar [4] [5] [6], ISAR and face recognition problems [7]. We have 
established similar results on ATR based on FLIR using polygonization of object silhouettes [8] [9] as the 
multiresolution preprocessor. Incorporation of compression into these algorithms is part of our current 
research. 

As a first step towards developing a progressive classification scheme with compression, we need 
to develop an algorithm for combined compression and classification at a fixed resolution. As opposed 
to the algorithm described in [3] that achieves this with posteriori estimation of the probability models 
underlying the different classes of signals, our goal is to develop an algorithm that is nonparametric, in 
the sense that it does not use estimates of probability distributions of the underlying sources generating 
the data. In this paper, we achieve that goal by using a variation of Learning Vector Quantization (LVQ), 
that cleverly takes into account the distortion present. Note that LVQ as described in [10], although 
designed to perform classification, automatically achieves some compression as a byproduct since it is 
inherently a vector quantization algorithm. However, our algorithm is designed to obtain a systematic 
trade-off between its compression and classification performances by minimizing a linear combination 
of the compression error (measured by average distortion) and classification error (measured by Bayes 
risk) with a variation of LVQ based on a stochastic approximation scheme. The convergence analysis of 
this algorithm essentially follows similar techniques as presented in [11] and as used in [12], However, 
our treatment is considerably simpler since to start with, we recognize that the algorithm is a special class 
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of the Robbins-Monro algorithm. 

In Section 2, we describe the LVQ-based algorithm for combined compression and classification. In 
Sections 2.1 and 2.2, we provide analysis and convergence of the algorithm using stochastic approximation 
techniques and the so-called ODE method. In Section 3, we provide simulation results of the performance 
of the algorithm for some typical problems. Section 4 presents some concluding remarks. 

2 Combined compression and classification with learning vector quan¬ 
tization 

Learning vector quantization (LVQ) introduced in [13] is a nonparametric method of pattern classifica¬ 
tion. As opposed to the parametric methods, this method does not attempt to obtain a-posteriori estimates 
of the underlying probability models of the different patterns that generate the data to be classified. It 
simply uses a set of training data for which the classes are known in a supervised learning algorithm to 
divide the data space into a number of Voronoi cells represented by the corresponding Voronoi vectors and 
their associated class decisions. Using the training vectors, these Voronoi vectors are updated iteratively 
until they converge. The algorithm involves three main steps: 

1. Find out which Voronoi cell a given training vector belongs to by the nearest-neighbor rule. 

2. If the decision of the training vector coincides with that of the Voronoi vector of this particular cell, 
move the Voronoi vector towards the training vector, else, move it away from the training vector. 
All the other Voronoi vectors are not changed. 

3. Obtain the next training vector and perform the first two steps. 

This process is usually carried out in multiple passes of the finite set of the training vectors. A 
detailed description of this algorithm with a preliminary analysis of its convergence properties using 
stochastic approximation techniques of [11] has been given in [12], It has also been indicated in [12] that 
as the number of training vectors goes to infinity, the classification error achieved by the LVQ algorithm 
approaches the optimal Bayes’ error. Although its primary goal is to classify the data into different 
patterns, the LVQ algorithm compresses the data in the process into a codebook of the size equal to the 
number of the Voronoi cells where each Voronoi vector represents the code for all the vectors belonging 
to that cell. 

In what follows, we present a simple variation of the LVQ algorithm in [12], that achieves a task 
of combined compression and classification. We present a convergence analysis of this algorithm much 
along the lines of [12]. However, we present a simpler analysis by recognizing that the algorithm is a 
special case of the Robbins Monro algorithm. Also, simulation results show that as a certain parameter 
is increased, the compression error gradually decreases compared to the error achieved by the standard 
LVQ (represented by the value zero of this parameter). 

In the next subsection, we introduce the notations and describe the algorithm. 
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Algorithm for combined compression and classification 

Consider a complete probability space (Q. F. P). Let X / £ R d , l = 1,2,..., TV represent the training 
vectors defined on this space, generated by either of the two patterns 1 or 2. The a priori probabilities of 
the two patterns are 7ti and 712 respectively and the corresponding pattern densities are p\ (x) and p 2 (x) 
respectively such that 

P(Xi £ B) = 7ti / pi(x)dx + W 2 / pi(x)dx (1) 

Jb J b 

We also assume that X/ is independent of Xj, j / l. 

The Voronoi vectors are represented by 8, £ R d , i = 1,2,... ,I\ and the corresponding Voronoi 
cells are represented by Vg x . Let the decision associated with the training vector X) be represented by 
dx t and that of the cell Vg i by dg i , where dx, ,d.g i £ {1,2}. 

Consider a non-increasing sequence of positive real numbers e„, n = 1,2,..., such that 

Assumption 2.1 1 e « = 00 

Consider also a distance function p(0, x) which satisfies the following assumptions: 

Assumption 2.2 p(6,x) is a twice continuously differentiable function of 6 and x and for every fixed 
x £ R rf , it is a convex function of 8. 

Assumption 2.3 For any fixed x, if8(k ) —)• 00 , as k —> 00 , then p(8(k),x ) —> 00 . 

Assumption 2.4 For every compact Q £ R d , there exist constants C\ and q\ such that for all 8 £ Q, 

| X 0 p{8,x) \< Ci{\ + I X H (2) 


An example of a function which satisfies the properties above is p(8, x) = \\8 — :c|| 2 where ||.| | is the 
Eucledian distance between two vectors. In our implementation of the algorithm, we use this distance 
function although for the sake of generality in the analysis, we would refer to it in its general form p(8, x). 
Define further the following quantities: 

Definition 2.1 

, d8i(n)vXn-£i, ©(tr)) — l.Y„ + i6t' f ) ;( „ ) i^-dx n + l =de i (n) ~ (3) 

where 0(n) = (8\{n ),... ,8Kin)) 1 and 8;(n) is the n-th iterate of 8{, n > 0. Also 1 a is the indicator 
function that takes the value 1 if A is true and 0 otherwise. 

Definition 2.2 

1 N i N 

gi (®(n);N ) = I '/ ^ H 1 x j ev Si( n ) l d Xj = i > lx FVe iM U Xj =2 

j =i J j =i 

= 2 otherwise (4) 
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Remark 2.1 Note that p,:(0(n); N) above denotes the decision associated with the i-th cell according 
to the majority vote rule. 

With the above definitions and assumptions, we can now write the following multi-pass combined 
compression and classification algorithm for A > 0, 

1 . Initialization : The algorithm is initialized with 0(0) usually found by running a vector quantization 
algorithm, e.g., LBG [14] algorithm over the set of training vectors. 

2. n = 0. 

3. Assigning the training vectors to their respective cells: Find i/ = argrnin m ||# m (n) — _Y/|| 2 , l = 

1,2,..., A r , then X) belongs to („). 

4. Cell decisions: Calculate p,;(0(n); iV), i = 1,2,...,A'. 

5. Updating the Voronoi vectors: For i G {1,2,..., A'}, 

0i(n+ 1) = 9i(n)+e n+ i(-\lx„ +l €V t ^ ) +y(d%„ +l ,g i (®(ny,N),X n+ i ,©(n))) V g p(6,X n+ i} \ g =e i (n) 

(5) 

6. n <— n + 1. 

7. If n < N, repeat Steps 3-6. If n = N, repeat Steps 3-4. 

The above algorithm can be executed for multiple passes over the same training set (in case the size of 
the training set is small) by using the values 0( N) from the m-th pass to initialize the algorithm for pass 
= m + 1 until to = M where M is the maximum number of passes. 

Remark 2.2 Note that Step 5, i.e., updating of the Voronoi vectors can be written in the following 
simplified manner: 

If -Y„+i G V 0t(n) , then 

6 i(n +1) = 6 i(n) + e„+i(-A - l)X g p{6,X n+ i) | 9 = 9 . (n) if d x „ + , = gi(®(n);N) 

= 0i(n) +e n+ i(-A + l)V e p(0,X n+ i) if d Xn+l ± gi(®(n)\N) (6) 

For j ^ i, 6j(n + 1) = Qj(n). 

Remark 2.3 Note that for A = 0, the above algorithm becomes the modified LVQ algorithm resulting 
in better convergence properties as reported in [12], 

2.1 Analysis of the combined compression and classification algorithm 

In this subsection, we present the analysis of the above algorithm using the “mean ODE” method of [11], 
Denote the vectors 

M®(")) = (h l (®(n)),...,h K (®(n)) 1 

and 

H(@(n),X n+1 ) = (Hi(0(«),I n+1 ),..., H K (®(n), X n+ i))' 
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where 


Hi(&(n),X n+ 1 ) = ( A1 v„ + j(d, Xn+t , 9i(®(n)-, N), X n+ i,@(n)))V g p(8, X n+l ) \ 0=0i{n) 


(7) 


and //,(0(n|), i = 1,2,..., A' is defined in Definition 2.4. Note that one can write the above algorithm 
(5) in the following manner: 


@(n + 1) = @(n) + e n+ ii?(0(n).,X„ +1 ), n > 0 


(8) 


Note that this is a special case of the general stochastic approximation algorithm of [11], quoted in 
Section 2, [12], 

Define 


p(x) = pi(x)tti +P 2 (x)tT 2 

q(x) = P2(x)w 2 ~ Pi(x)m (9) 


Due to the assumption that { X /). I = 1,2,..., is a sequence of i.i.d. random vectors and the 
fact that they are distributed independently of 0(1), the transition probability function n M( )); (.4. X n ) = 
P(X n+l G A | T n ) is given by p(A) = f A p(x)dx, where T n = cr{0(O),X o ,... ,&(n),X n }. This 
makes the above algorithm a special case of the Robbins-Monro algorithm with the transition probability 
function being independent of 0(n). 

Now, we introduce the following definitions: 


Definition 2.3 


7 i(&(n);N) = sign 



N 

1 ev eAn) (Uy,=2 

3 = 1 



( 10 ) 


Remark 2.4 Note that 7 ,;(©(n); N ) = 1 if gi(&(n)\N) = 2 and —1 otherwise. 

Definition 2.4 

hi(@) = - f [ 7 , : ( 0 ;!V)g(:r) + \p(x)]X 0 p(6,x)\ 0=0i dx, i = 1,2,...,A' (11) 

Jv ai 


One can now prove the following Lemma: 

Lemma 2.1 

//,(©(,/), A',,-.,) = //,(©(//)) + &(n), i = 1,2,..., K 
where {£,;('/() } is a J- n -adapted martingale difference sequence such that 

hi(@(n )) = E a [H,(@(n),X n+1 ) \ T n ], Vi 


( 12 ) 


(13) 
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Here, E a denotes expectation under P a where P a denotes the probability distribution for {X n . ©(//,)}, n > 
0 where 0(0) = a. Note that since { X n } is a sequence of i.i.d. random vectors, P a is independent of 

-V 0 = x. 

We write the mean ODE associated with (8) as 

0 = h(&), 0(0) = a (14) 

where 

M®) = lim E a [Hi(®,X n+1 )\P n ]= I Hi(@,x)p(x)dx (15) 

n—too J 

since in this case {_Y„ } is a sequence of i.i.d. random variables where P{X n +i € ,4 | J 7 ,, ) is independent 
of 0(fc), k < n. 

It is hard to establish a convergence result for general h(&) and often it is assumed that (14) has 
an attractor 0*, whose domain of attraction is given by D*. If Q is a compact subset of D* and 
0(0) = a £ Q , one can show that for any 6 > 0, 

P{max||0(n) -0(a,f„)|| > <5} < C(a,Q)Ve® (16) 

n z ' 

n 

where t n = anc * ®( a i ^« ) ' s the solution to (14) for t = t n , and C(a, Q) is a constant dependent 

on a and Q (see Theorem 4, page 45, [11]). Here, obviously, we have assumed Assumption 2.1. 

One could also derive the following corollary (see Corollary 6, page 46, [11]), which says that under 
the assumptions (16) is true, for the set of trajectories {©(;/ )} that visit Q infinitely often, we have 

0(n) —» ©*, P a — a.s. (17) 

P{limsup ||©(n) — 0(a,f n )|| > <5} = 0 (18) 

n—c 

However, there is no general theory which gives conditions under which 
P(&(n ) G Q in finitely of ten ) = 1 is satisfied [11], 

Note that for a complete theory, it is essential to prove that the desired points of convergence 9* are 
indeed the stable equilibrium points of (14). One way to do this is to find a potential function J(0), if it 
exists, such that /i,(0) = —Vo, •/(©). Then one can apply results from Lyapunov stability to establish 
results for stable equilibrium by studying the local minima of J(.) and their domains of attraction. 
Although, we refrain from such pursuits for the time being, we do notice that (see [12]) as N —> oo, 
7 ,;(©; X T ) —> sign(f v q(x)dx) and using the mean value theorem when the size of each Voronoi cell is 
small, one can write that /i, (0) is approximately equal to 

/b(0) « — [ Vgp(9,x), (\q(x)\+\p(x))dx (19) 

JVg. 

which is the negative gradient of the cost function 

K f 

JW = V p(9i,x)(\q(x)\ + \p(x))dx (20) 

*=t Jv »i 

For those readers who are more oriented towards intuitive reasoning, we comment here that this 
was indeed the inspiration of obtaining the combined compression and classification algorithm given 
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above. The reason for this intuition is that as indicated in [12], for the LVQ algorithm, the first part of 
the integrand in (20) converges to the optimal Bayes cost when the number of Voronoi vectors tends to 
infinity. Details of this analysis can be found in [10], The second part is clearly the average distortion. 


2.2 Convergence analysis of the combined compression and classification algo¬ 
rithm 


The convergence analysis for a class of learning vector quantization algorithm was presented in [12] 
following the analysis in [11] (see Part II- Chapter 1). However, as we noted before that the algorithm 
under investigation is a special case of the Robbins-Monro algorithm, where the transition probability 
function is independent of ©, we can simplify the set of assumptions needed greatly. In particular, the 
assumptions described as A.4 in [11], pp. 216, become trivial and follow from the single assumption 
that h{& ) is locally Lipschitz. In this subsection, we obtain an upper bound on the L q estimate of a 
“fluctuation” term to be introduced shortly, for q > 2. We will provide a simpler local bound later on for 

q = 2. 

Consider again the algorithm: 

®{n + 1) = ®(n) + e n+ iH(&(n),X n+ i), n > 0 (21) 


Before we introduce the set of assumptions needed for the analysis of our algorithm, for the purpose 
of this section, let us introduce the following notations: 

Notation 2.1 1. D is an open subset of ]R d . Q is a compact subset of D. 

2. <f> is a C 2 function from ]R d to 1R with bounded second derivatives, where 


M 0 (Q ) 
MfQ) 
M 2 (Q ) 

m 2 


sup |0(0)| 
&£Q 


sup 10'(0)1 
®€Q 


sup 10" (0)1 

@£Q 


sup 10" (0)1 
0GR d 


3. There exists a R(<p , 0,0 P ) such that 


( 22 ) 


7?(0,0,0') = 0 (©')- 0 (©)-<(©'-©), 0 '(©)> 

|i?(0, 0, @')| < M 2 \& - 0| 2 , V0, 0' G !R d (23) 


4. 

e« (0) = 0(©(n + 1)) - 0(®(n)) - e n+ i(0'(0(n)), h(®(n))) (24) 

5. Fore > 0, 

t(Q) = inf(n; @(n) ^ Q) 
ct(s) = inf(n > 1; |@(n) — 0(n — 1)| > e) 
v(e,Q) = inf(r(Q), <r(e)) (25) 
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6 . With to = 0, t n = 52 "=i £j, we define m(n,T) = inf{7 : k > n, 52 ; n e «+i > 7} 

Suppose Assumption 2.1 holds. Also, let us make the following additional assumptions that will be 
sufficient for our analysis: 

Assumption 2.5 For any compact subset Q of D, there exist constants C\, r\ such that 

\H(®,x)\ <(7i(l + |af') (26) 


Remark 2.5 Note that for our choice of H (0, x) described in the previous section, (26) is satisfied from 
Assumption 2.3. 

Assumption 2.6 h(&) = {hi (©),..., /i& (0))' where /i; (©) given by (13) is locally Lipschitz. 

Remark 2.6 Note that this assumption itself is enough in our analysis and we do not need the assumptions 
made in [12] following [11] (Assumption (A.4), pp. 216) since they trivially follow from Assumption 
2 . 6 . 

Assumption 2.7 For any q > 1, 3 a constant M < oo such that 

sup^{|Ag«7(n < v(e, Q))} < M (27) 


Remark 2.7 Since { X v } is a sequence of i.i.d. random vectors, one can simply write (27) as 

/ \x\ q p(dx)dx < M (28) 

jR d 

Remark 2.8 One can in fact deduce from Assumptions 2.5 and 2.7 that under certain other restrictions 
on the distribution function p{dx), that Assumption 2.6 holds, since in this case p(dx) is independent of 
0 (see Section 2.2.6, pp. 264-265 of [11]). 

We can now present the following theorem: 

Theorem 2.1 Consider the update equation (21). Consider also (24), (25). Suppose Assumptions 2.1, 
2.5, 2.6, 2.7 hold. Then, for any regular function <f> with bounded second derivatives satisfying (22), any 
compact subset Q of D, and for all q > 2 there exist constants B(q), M^, eo > 0 such that for all e < £ q , 

T > 0, a G D, we have 

k— 1 m(n,T ) m(n,T) 

E a { sup I(k<u(e,Q))\J2em q }<B(q)M l (Q)Ti- 1 £ ^f + M^yn 1 £ e\+ q 

n<k<m(n,T) J=n+1 *=n+l 

(29) 
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Proof: In this proof, Ci{q),C 2 {q),Cs(q),C 4 {q),B(q),M 4 (q) denote constants dependent only on q. 
From (24), (23) and (21), one can write 


e k (4>) = efc+t^'(©(*)), (H(®(k),X k+l ) - h(®(k)))) + R(d>,Q(k),®(k + 1)) 


= 4 1, +ef ) 


(30) 


where 


Note that we have 


e[} ] = i (<.’/ (©(/.')). (H(®(k),X k+l ) - li(®(k)))) 

ef =R(0,®(k),®(k+l)) 


k -1 


k • 1 


k-1 


E e ^)i 9 = iE e r + E 


= ( 2 ) 19 


«=n «=n 

< [lE^’i + iE^ir 

i=n i=n 

< ^-‘[lE^r + iE^n 


(31) 


From now on, we write m for m(n,T) and v for v(e,Q) for notational simplicity. We write S\ = 
E { su Pn<k<m !( k < ")\ EE! EE} = ^{sup n<fc < m | EE! ^I ? } where 

Vi = (, .,i(o'(0(/)l. (#(©((), X,; +1 ) - h(@(i))))I(i + 1 < I/) 

Denoting V) = {<p'(®(i)) t ( II (0(i). X,. j) - //(©(■/))))/(/ + 1 < i/), we have [/,; = <,. |F ; . 

We notice that from (13), E (17, + 1 1 (/,) = 0. 

We also observe that from (22) 


£|V)| 9 < M 1 (Q)Ci(g)[F;|Ff(0(*),Wi +1 |P + F;|/ l (©(())|«] 

< M 1 (Q)C' 2 (<z)£|tf(0(i),Xi+i)| 9 

The last inequality follows from Jensen’s inequality and (13). 

One can now use Assumption 2.5 and Assumption 2.7 to obtain the following upper bound: 

m\< < Mi{Q)C$(q) 

One can now apply Burkholder’s inequality (see Lemma 6, pp. 294, [11]) to obtain 

m— 1 


(32) 


Si < C 4 (q)E('£ el ilf)* /2 


(33) 


(34) 


For q > 2, one can further apply a result based on Holder’s inequality (see Lemma 7, pp. 294, [11]) to 
obtain 

m— 1 m —1 

1+f 


Si < Ct(7)(E e *+i) f “' E e m^l 


i=n 

m— 1 


1+2 


< B(g)M 1 (Q)Ti- 1 El:+. 


(35) 


10 



We prove the following bound on S 2 = \e\ ' \I(i + 1 < r/)} 9 using (23),(26) and Assumption 

2.7: 

m —1 

S2< \hUi)n (36) 

i=n 

Combining (35), (36), we obtain (29) from (31). □ 

Next, we present a theorem that gives an upper bound on the L q norm of the distance between the actual 
iterate 0(n) and ©(a, t„) which is the solution to (14) for t = t n . In other words, this result gives an 
upper bound on the quality of approximation by the mean trajectory represented by (14). We do not 
provide the proof since the result holds under the same set of assumptions as the previous theorem and 
the proof can be found in [11], pp. 301. 

Theorem 2.2 Consider the update equation (21) and (14). Suppose Assumptions 2.1, 2.5, 2.6, 2.7 hold. 
Suppose Qi C Q 2 are compact subsets of D, and q > 2. Then there exist constants B\(q), L 2 (L 2 is the 
Lipschitz constant for h in Q 2 ), such that for all T > 0 (that satisfy the condition that for all a G Q \ , all 
t <T, d(&(a,t .), Qf) > So > 0), all S < So, all a G Q\, 

Pa{ sup |©(n) — &(a,t n )\ q > h'} < —4^-(l + T) q ~ l exp(qL 2 T) V (37) 

n<m(0,T) ^ j 

We now present an asymptotic result without proof that states that 0(n) asymptotically converges to 
a compact subset of D, based on the assumption that the mean ODE has a point of asymptotic stability 
0* in D with domain of attraction D. We make more precise statements later. First, we introduce the 
following additional assumptions and notations: 

Assumption 2.8 There exists a such that e" < oo. 

Assumption 2.9 There exists a positive function U of class C 2 on D such that 17(0) -G C < oo if 
© —> dD or |©| —> oo and (7(0) < C for 0 G D satisfying 

(C , (0),/i(©)) <0, V0G D (38) 


Remark 2.9 Note that if there is such a point 0* in D which is a point of asymptotic stability for the 
mean ODE (14) with domain of attraction D, this means that any solution of (14) for a G D indefinitely 
remains in D and converges to 0* as t —> oo. It can then be shown that (see [15], Th. 5.3, p.31) there 
exists a function (7(0) which satisfies the conditions mentioned in Assumption 2.9. 

Notation 2.2 


K(c ) = {0; (7(0) < c} 

t(c) = inf(n; 0(n) K(c )) 

qo(a) = sup(2, 2(a — 1)) (39) 
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With these notations and assumptions, we can present the following theorem (for a proof see [11], 
pp. 301-304): 

Theorem 2.3 Consider (21). Suppose Assumptions 2.1, 2.5, 2.6, 2.7, 2.8, 2.9 hold and suppose that F 
is a compact set such that 

F = {0; 17(0) < c 0 } D {©; U'(®).h(@) = 0} 

for some Co < C where C is defined in Assumption 2.9. Then, for any compact subset Q of D, and 
q > qo(a), there exists a constant B 2 (q) such that for all a E Q: 

P a (&(n ) converges to F) > 1 — B^iq) E e\ +i (40) 

*>l 

In the next subsection, we provide a simpler local bound for q = 2 following the analysis given in 
Section 5.1 ofPart-II [11], 

2.3 A simpler local bound for q = 2 

Consider again the algorithm: 

0(n + 1) = ®(n) + e n+ iH(@(n),X n+ i), n > 0 (41) 

Since X n , n > 0 are distributed independently of 0 (h ) and also {.Y,,}. n > 0 is a sequence of i.i.d. 
random variables, we have the main or so-called Robbins-Monro assumption satisfied, namely, 

E[g(®(n),X n+l ) | F n ] = f g(®(n),x)p(x)dx (42) 

JR d 

Note that we have already observed before in Lemma 2.1 that 

h(®(n))=E a [H(®(n),X n+i )\T n ]= / H(®(n),x)p(x)dx (43) 

JR d 

Next, we introduce the two main assumptions of this section: 

Assumption 2.10 For all 0(0) = a E IR'} 

/•;„[ II (©(//), A„ . i) 2 I Tn] < Ci(l + |0(n)| 2 ) (44) 

for some suitable constant Ci. 

Remark 2.10 Note that this assumption guarantees the existence of h(®(n)). 

Assumption 2.11 30* (which is a point of asymptotic stability of (14) such that for all 0, 3 a constant 
5 > 0 such that 

(0 — 0*)'/j(0) < —h|0 — 0* | 2 (45) 

with, for some 8 < 1, 

e 6 - e 13 

lim inf 2 6-2- + -^8 -21 > 0 (46) 

” e«+i < + i 


12 



Remark 2.11 Note that if e„, = 4i , 0 < a < 1, then (46) holds for all ft < I. It is true for 3 = 1 

if2<5> t- 

We can now present the following theorem which gives a simple local bound for the expected distance 
between 0(n) and 0*: 

Theorem 2.4 Consider (41). Suppose Assumptions 2.10, 2.11 hold. Then, 

E a (\®(n) - ©*| 2 ) < B 5 (a)e^ (47) 

for some suitable constant B^(a). 

Proof: It is sufficient to show that for some suitable no, there exists a Bs(a, no) such that for all n > no, 

E a (\Q(n)-Q*\ 2 ) <B 5 (a,n 0 )ei (48) 

Writing J„ = 0(n) — ©*, we have 

E a (\Jn+i\ 2 I Tn) = I J n | 2 + 2e„+l {J n , h(@(n)) + e 2 n+1 E a [\H(@(n),X n+ i) \ 2 I T n ] (49) 
Suppose that n is sufficiently large such that 1 > 2e n+ \8. Then, by taking expectations, we have 

E a \Jn+l I 2 < (1 - 2e„ +1 <5 + Cytl +i )E a \J n \ 2 + C]( 2 n ; | (50) 

where C\ is a constant such that 

(?i(l + |©| 2 ) <(?i(l + |©-0*| 2 ) (51) 

Now, one can use the following result which can be proved directly from (46). There exists B° and no 
such that for all B$ > B° and n > no, the sequence a,, = Tf ff satisfies 

u n +1 > (1 - 2e„ + i<5 + C\e 2 l+l )u n + C\ 4+t (52) 

Choose B$(a, no) > B° such that 

E a \J no \ 2 < B 5 (a,n 0 )e^ 0 

It follows immediately by induction on n that the sequence u n = B^(a,nf)t^ n > no satisfies 
E a \J n \ 2 < u„ from which (47) follows. □ 


3 Simulation Studies 

In this section, we present some simulation results illustrating the compression performance of our 
algorithm while a trade-off is obtained with respect to its classification performance. We consider two 
examples, one with computer simulated data distributed according to either of two bimodal Gaussian 
densities and the other with “mel-cepstraf’ coefficients of two female speakers obtained from their speech. 
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Bimodal Gaussian data 


This part of the simulation study is carried out with computer generated random numbers distributed 
according to either of two two-dimensional bimodal Gaussian distributions. The first pattern is generated 
from the bimodal Gaussian density 0.5A'([1.0 1.0]', 7) + 0.5A r ([—1.0 — 1.0]', 7) where JV([mi m 2 ]', X) 
is the two-dimensional normal distribution function with the mean vector [mi m2]' and covariance matrix 
E. The second pattern is generated from the density 0.4A T ([0.0 0.0]',47) 4- 0.6A T ([0.5 0.5]', 47). The 
training set was formed by 500 vectors from each pattern (meaning 7r 1 = 7T2 = 0.5). This set was 
used to train the Voronoi vectors in multiple passes the total number of passes being 20. The number 
of Voronoi vectors that would result in a good classification performance was found by increasing the 
number of Voronoi vectors by one until the classification performance (for a given size of test data set) 
reached a floor. Thus 16 Voronoi cells were chosen and their centroids initialized by the output of an 
LBG algorithm processing the training data. Each test data set had a size of 1000 each containing vectors 
from pattern 1 and pattern 2 such that the a priori probabilities were satisfied. The learning rate e n was 
kept fixed over one pass such that e p = where p denotes the number of the pass, and ei = 0.01. The 
compression performance averaged over 10 test data sets for a range of A G [0.0,5.0] is given in Figure 
1. The compression error was measured by the minimum mean square error that is the average of the 
squared distances between the test vectors and their representative Voronoi vectors and normalized with 
respect to the compression error achieved by the pure LVQ algorithm (A = 0.0). It is seen that as A 
increases up to 5.0, there is a reduction of approximately 3.5% in the normalized compression error. 

The classification performance measured by the percentage of misclassified data did not change very 
much with increasing value of A and tended to hover around 30% in the range of A as mentioned above. 
Hence we did not include a separate plot for the classification performance. 

Mel-Cepstral coefficients of 2 speakers 

This example is based on “mel-cepstrum” coefficients of two female speakers. “Mel-cepstrum” features 
based on the nonlinear human perception of the frequency of sounds have been well studied and suc¬ 
cessfully applied to speaker identification problems. These studies have shown that the mel-cepstrum 
can effectively extract the vocal tract shape information of the speakers and yield good distinguishing 
performance [16] [17]. In our example, the labeled phonetic speech data of the two female speakers are 
extracted from the TIMIT database for dialect region 2. The speech waveform is segmented into 16 ms 
frames overlapped by 8 ms and parameterized to a 14 dimensional mel-cepstrum vectors to establish the 
feature space. 

Since the performance of an LVQ type algorithm depends critically on the number of Voronoi vectors 
and the number of training vectors per Voronoi cell, to achieve a trade-off with the computational time 
required, the following parameters were chosen. The training set was randomly chosen to have 500 data 
vectors from each speaker. The number of Voronoi cells was chosen to be 20. The training set was used 
to update the Voronoi vectors in multiple passes, the total number of passes being 30. The learning rate 
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e n was taken to be constant over one pass where e p = -^= where p denotes the number of passes with 
t\ = 0.04. The Voronoi vectors were initialized by passing the training set through an LBG algorithm. 
Once the training was completed, 5 sets of test data, each containing 250 vectors taken randomly from 
the database for both speakers, were used to obtain the compression and classification performances 
of our algorithm. Figures 2 and 3 illustrate the results averaged over 5 test data sets, for a range of 
A 6 [0.0,5.0]. As expected, the compression error (measured by the mean square distance between the 
data and its representative Voronoi vector), which was normalized with respect to the error obtained by 
the pure LVQ algorithm (A = 0.0), decreases by approximately 7%, whereas the classification error goes 
up by 4.5%. We would like to comment here that the classification error can be further reduced by a 
choice of larger number Voronoi cells which would obviously require larger number of training vectors. 

4 Conclusions and future research 

We have developed an algorithm based on learning vector quantization (LVQ) for combined compression 
and classification. We have shown convergence of the algorithm, under reasonable conditions, using the 
ODE method of stochastic approximation. We have also illustrated the performance of the algorithm 
with some examples. The sensitivity of the performance of the algorithm with respect to the weight 
parameter A indicates that the compression error decreases with increasing A whereas the increase in 
classification error is relatively insignificant. 

An important future research problem that we are currently working on is the extension of the 
algorithm when the VQ is replaced by TSVQ. In this extension, we use and extend the methods and 
analysis of [18], With this extension, we will be able to treat the performance of the WTSVQ algorithm 
°f [4] [5] [6], [7] analytically including compression of the wavelet coefficients. 
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compression error versus k 



Figure 1: Compression error perfomance of the combined LVQ algorithm for bimodal Gaussian patterns 


compression error versus X 



Figure 2: Compression error performance of the combined LVQ algorithm for “mel-cepstrum” features of 
female speakers 
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classification error versus A. 



Figure 3: Classification error performance of the combined LVQ algorithm for “mel-cepstrum” features of 
female speakers 
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